On November 27, 1895, Alfred Nobel signed his last will in Paris. When it was opened after his death, the will caused a lot of controversy, as Nobel had left much of his wealth for the establishment of a prize.
Alfred Nobel dictates that his entire remaining estate should be used to endow “prizes to those who, during the preceding year, have conferred the greatest benefit to humankind”.
Every year the Nobel Prize is given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace.

Let's see what patterns we can find in the data of the past Nobel laureates. What can we learn about the Nobel prize and our world more generally?
# pip install --upgrade plotly
# %pip install --upgrade plotly
import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.display.float_format = '{:,.2f}'.format
df_data = pd.read_csv('nobel_prize_data.csv')
Caveats: The exact birth dates for Michael Houghton, Venkatraman Ramakrishnan, and Nadia Murad are unknown. I've substituted them with mid-year estimate of July 2nd.
Preliminary data exploration.
df_data? How many rows and columns?df_data.head(5)
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1901 | Chemistry | The Nobel Prize in Chemistry 1901 | "in recognition of the extraordinary services ... | 1/1 | Individual | Jacobus Henricus van 't Hoff | 1852-08-30 | Rotterdam | Netherlands | Netherlands | Male | Berlin University | Berlin | Germany | NLD |
| 1 | 1901 | Literature | The Nobel Prize in Literature 1901 | "in special recognition of his poetic composit... | 1/1 | Individual | Sully Prudhomme | 1839-03-16 | Paris | France | France | Male | NaN | NaN | NaN | FRA |
| 2 | 1901 | Medicine | The Nobel Prize in Physiology or Medicine 1901 | "for his work on serum therapy, especially its... | 1/1 | Individual | Emil Adolf von Behring | 1854-03-15 | Hansdorf (Lawice) | Prussia (Poland) | Poland | Male | Marburg University | Marburg | Germany | POL |
| 3 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Frédéric Passy | 1822-05-20 | Paris | France | France | Male | NaN | NaN | NaN | FRA |
| 4 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Jean Henry Dunant | 1828-05-08 | Geneva | Switzerland | Switzerland | Male | NaN | NaN | NaN | CHE |
df_data.shape
(962, 16)
df_data.columns
Index(['year', 'category', 'prize', 'motivation', 'prize_share',
'laureate_type', 'full_name', 'birth_date', 'birth_city',
'birth_country', 'birth_country_current', 'sex', 'organization_name',
'organization_city', 'organization_country', 'ISO'],
dtype='object')
min(df_data.year)
1901
max(df_data.year)
2020
df_data.duplicated().sum()
0
print(f' is the data has nan values: {df_data.isna().sum().any()}, and thier count is {df_data.isna().sum().sum()}')
is the data has nan values: True, and thier count is 1023
df_data.isna().sum()
year 0 category 0 prize 0 motivation 88 prize_share 0 laureate_type 0 full_name 0 birth_date 28 birth_city 31 birth_country 28 birth_country_current 28 sex 28 organization_name 255 organization_city 255 organization_country 254 ISO 28 dtype: int64
birth_date column to Pandas Datetime objectsshare_pct which has the laureates' share as a percentage in the form of a floating-point number.df_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 962 entries, 0 to 961 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 962 non-null int64 1 category 962 non-null object 2 prize 962 non-null object 3 motivation 874 non-null object 4 prize_share 962 non-null object 5 laureate_type 962 non-null object 6 full_name 962 non-null object 7 birth_date 934 non-null object 8 birth_city 931 non-null object 9 birth_country 934 non-null object 10 birth_country_current 934 non-null object 11 sex 934 non-null object 12 organization_name 707 non-null object 13 organization_city 707 non-null object 14 organization_country 708 non-null object 15 ISO 934 non-null object dtypes: int64(1), object(15) memory usage: 120.4+ KB
df_data.birth_date = pd.to_datetime(df_data.birth_date)
df_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 962 entries, 0 to 961 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 962 non-null int64 1 category 962 non-null object 2 prize 962 non-null object 3 motivation 874 non-null object 4 prize_share 962 non-null object 5 laureate_type 962 non-null object 6 full_name 962 non-null object 7 birth_date 934 non-null datetime64[ns] 8 birth_city 931 non-null object 9 birth_country 934 non-null object 10 birth_country_current 934 non-null object 11 sex 934 non-null object 12 organization_name 707 non-null object 13 organization_city 707 non-null object 14 organization_country 708 non-null object 15 ISO 934 non-null object dtypes: datetime64[ns](1), int64(1), object(14) memory usage: 120.4+ KB
Filtering on the NaN values
col_subset = ['year','category', 'laureate_type',
'birth_date','full_name', 'organization_name']
df_data.loc[df_data.birth_date.isna()][col_subset]
| year | category | laureate_type | birth_date | full_name | organization_name | |
|---|---|---|---|---|---|---|
| 24 | 1904 | Peace | Organization | NaT | Institut de droit international (Institute of ... | NaN |
| 60 | 1910 | Peace | Organization | NaT | Bureau international permanent de la Paix (Per... | NaN |
| 89 | 1917 | Peace | Organization | NaT | Comité international de la Croix Rouge (Intern... | NaN |
| 200 | 1938 | Peace | Organization | NaT | Office international Nansen pour les Réfugiés ... | NaN |
| 215 | 1944 | Peace | Organization | NaT | Comité international de la Croix Rouge (Intern... | NaN |
| 237 | 1947 | Peace | Organization | NaT | American Friends Service Committee (The Quakers) | NaN |
| 238 | 1947 | Peace | Organization | NaT | Friends Service Council (The Quakers) | NaN |
| 283 | 1954 | Peace | Organization | NaT | Office of the United Nations High Commissioner... | NaN |
| 348 | 1963 | Peace | Organization | NaT | Comité international de la Croix Rouge (Intern... | NaN |
| 349 | 1963 | Peace | Organization | NaT | Ligue des Sociétés de la Croix-Rouge (League o... | NaN |
| 366 | 1965 | Peace | Organization | NaT | United Nations Children's Fund (UNICEF) | NaN |
| 399 | 1969 | Peace | Organization | NaT | International Labour Organization (I.L.O.) | NaN |
| 479 | 1977 | Peace | Organization | NaT | Amnesty International | NaN |
| 523 | 1981 | Peace | Organization | NaT | Office of the United Nations High Commissioner... | NaN |
| 558 | 1985 | Peace | Organization | NaT | International Physicians for the Prevention of... | NaN |
| 588 | 1988 | Peace | Organization | NaT | United Nations Peacekeeping Forces | NaN |
| 659 | 1995 | Peace | Organization | NaT | Pugwash Conferences on Science and World Affairs | NaN |
| 682 | 1997 | Peace | Organization | NaT | International Campaign to Ban Landmines (ICBL) | NaN |
| 703 | 1999 | Peace | Organization | NaT | Médecins Sans Frontières | NaN |
| 730 | 2001 | Peace | Organization | NaT | United Nations (U.N.) | NaN |
| 778 | 2005 | Peace | Organization | NaT | International Atomic Energy Agency (IAEA) | NaN |
| 788 | 2006 | Peace | Organization | NaT | Grameen Bank | NaN |
| 801 | 2007 | Peace | Organization | NaT | Intergovernmental Panel on Climate Change (IPCC) | NaN |
| 860 | 2012 | Peace | Organization | NaT | European Union (EU) | NaN |
| 873 | 2013 | Peace | Organization | NaT | Organisation for the Prohibition of Chemical W... | NaN |
| 897 | 2015 | Peace | Organization | NaT | National Dialogue Quartet | NaN |
| 919 | 2017 | Peace | Organization | NaT | International Campaign to Abolish Nuclear Weap... | NaN |
| 958 | 2020 | Peace | Organization | NaT | World Food Programme (WFP) | NaN |
rows where the organization_name column has no value
col_subset = ['year','category', 'laureate_type','full_name', 'organization_name']
df_data.loc[df_data.organization_name.isna()][col_subset]
| year | category | laureate_type | full_name | organization_name | |
|---|---|---|---|---|---|
| 1 | 1901 | Literature | Individual | Sully Prudhomme | NaN |
| 3 | 1901 | Peace | Individual | Frédéric Passy | NaN |
| 4 | 1901 | Peace | Individual | Jean Henry Dunant | NaN |
| 7 | 1902 | Literature | Individual | Christian Matthias Theodor Mommsen | NaN |
| 9 | 1902 | Peace | Individual | Charles Albert Gobat | NaN |
| ... | ... | ... | ... | ... | ... |
| 932 | 2018 | Peace | Individual | Nadia Murad | NaN |
| 942 | 2019 | Literature | Individual | Peter Handke | NaN |
| 946 | 2019 | Peace | Individual | Abiy Ahmed Ali | NaN |
| 954 | 2020 | Literature | Individual | Louise Glück | NaN |
| 958 | 2020 | Peace | Organization | World Food Programme (WFP) | NaN |
255 rows × 5 columns
seperated_values = df_data.prize_share.str.split('/',expand=True)
numerator = pd.to_numeric(seperated_values[0])
denomerator =pd.to_numeric(seperated_values[1])
df_data['share_pct'] = numerator / denomerator
df_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 962 entries, 0 to 961 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 962 non-null int64 1 category 962 non-null object 2 prize 962 non-null object 3 motivation 874 non-null object 4 prize_share 962 non-null object 5 laureate_type 962 non-null object 6 full_name 962 non-null object 7 birth_date 934 non-null datetime64[ns] 8 birth_city 931 non-null object 9 birth_country 934 non-null object 10 birth_country_current 934 non-null object 11 sex 934 non-null object 12 organization_name 707 non-null object 13 organization_city 707 non-null object 14 organization_country 708 non-null object 15 ISO 934 non-null object 16 share_pct 962 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(1), object(14) memory usage: 127.9+ KB
Creating a donut chart using plotly which shows how many prizes went to men compared to how many prizes went to women. What percentage of all the prizes went to women?
biology = df_data.sex.value_counts()
fig = px.pie(labels=biology.index,
values=biology.values,
title="Percentage of Male vs. Female Winners",
names=biology.index,
hole=0.6,)
fig.update_traces(textposition='inside', textfont_size=15, textinfo='percent')
fig.show()
C:\Users\mohamed.beshier\Anaconda3\lib\site-packages\plotly\express\_core.py:138: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version. Convert to a numpy array before indexing instead. return args["labels"][column]
birth_country? Were they part of an organisation?# df_data[df_data.sex == 'Female'].sort_values('year', ascending=True)[:3]
df_data.sort_values(by='year',ascending=True).query('sex=="Female"').head(3)[['full_name','prize','year','birth_country']]
| full_name | prize | year | birth_country | |
|---|---|---|---|---|
| 18 | Marie Curie, née Sklodowska | The Nobel Prize in Physics 1903 | 1903 | Russian Empire (Poland) |
| 29 | Baroness Bertha Sophie Felicita von Suttner, n... | The Nobel Peace Prize 1905 | 1905 | Austrian Empire (Czech Republic) |
| 51 | Selma Ottilia Lovisa Lagerlöf | The Nobel Prize in Literature 1909 | 1909 | Sweden |
Did some people get a Nobel Prize more than once? If so, who were they?
is_winner = df_data.duplicated(subset = ['full_name'] , keep=False)
multiple_winners = df_data[is_winner]
print(f'there are {multiple_winners.full_name.nunique()} winners who where awarded more htran one time ')
there are 6 winners who where awarded more htran one time
col_subset = ['year', 'category', 'laureate_type', 'full_name']
multiple_winners[col_subset]
| year | category | laureate_type | full_name | |
|---|---|---|---|---|
| 18 | 1903 | Physics | Individual | Marie Curie, née Sklodowska |
| 62 | 1911 | Chemistry | Individual | Marie Curie, née Sklodowska |
| 89 | 1917 | Peace | Organization | Comité international de la Croix Rouge (Intern... |
| 215 | 1944 | Peace | Organization | Comité international de la Croix Rouge (Intern... |
| 278 | 1954 | Chemistry | Individual | Linus Carl Pauling |
| 283 | 1954 | Peace | Organization | Office of the United Nations High Commissioner... |
| 297 | 1956 | Physics | Individual | John Bardeen |
| 306 | 1958 | Chemistry | Individual | Frederick Sanger |
| 340 | 1962 | Peace | Individual | Linus Carl Pauling |
| 348 | 1963 | Peace | Organization | Comité international de la Croix Rouge (Intern... |
| 424 | 1972 | Physics | Individual | John Bardeen |
| 505 | 1980 | Chemistry | Individual | Frederick Sanger |
| 523 | 1981 | Peace | Organization | Office of the United Nations High Commissioner... |
df_data.category.nunique()
6
categoryData = df_data.category.value_counts()
fig = px.bar(df_data,y = categoryData.values ,
x = categoryData.index,
color = categoryData.values,
color_continuous_scale='Aggrnyl',
title = 'Number of Prizes per Category'
)
fig.update_layout(xaxis_title = 'Nobel Prize Ctegory',
yaxis_title = 'bumver of prizes',
coloraxis_showscale = False )
fig.show()
df_data[df_data.category=='Economics'].sort_values(by='year' , ascending=True)[:3]
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 393 | 1969 | Economics | The Sveriges Riksbank Prize in Economic Scienc... | "for having developed and applied dynamic mode... | 1/2 | Individual | Jan Tinbergen | 1903-04-12 | the Hague | Netherlands | Netherlands | Male | The Netherlands School of Economics | Rotterdam | Netherlands | NLD | 0.50 |
| 394 | 1969 | Economics | The Sveriges Riksbank Prize in Economic Scienc... | "for having developed and applied dynamic mode... | 1/2 | Individual | Ragnar Frisch | 1895-03-03 | Oslo | Norway | Norway | Male | University of Oslo | Oslo | Norway | NOR | 0.50 |
| 402 | 1970 | Economics | The Sveriges Riksbank Prize in Economic Scienc... | "for the scientific work through which he has ... | 1/1 | Individual | Paul A. Samuelson | 1915-05-15 | Gary, IN | United States of America | United States of America | Male | Massachusetts Institute of Technology (MIT) | Cambridge, MA | United States of America | USA | 1.00 |
Creating a plotly bar chart that shows the split between men and women by category.
cat_men_women = df_data.groupby(['category','sex'] ,
as_index=False).agg({'prize':pd.Series.count})
cat_men_women.sort_values('prize',ascending =False , inplace = True)
cat_men_women
| category | sex | prize | |
|---|---|---|---|
| 11 | Physics | Male | 212 |
| 7 | Medicine | Male | 210 |
| 1 | Chemistry | Male | 179 |
| 5 | Literature | Male | 101 |
| 9 | Peace | Male | 90 |
| 3 | Economics | Male | 84 |
| 8 | Peace | Female | 17 |
| 4 | Literature | Female | 16 |
| 6 | Medicine | Female | 12 |
| 0 | Chemistry | Female | 7 |
| 10 | Physics | Female | 4 |
| 2 | Economics | Female | 2 |
fig = px.bar(cat_men_women,y = 'prize' ,
x = 'category',
color ='sex',
color_continuous_scale='Aggrnyl',
title='Number of Prizes Awarded per Category split by Men and Women')
fig.update_layout(xaxis_title = 'Nobel Prize Ctegory',
yaxis_title = 'number of prizes',
coloraxis_showscale = False )
fig.show()
Are more prizes awarded recently than when the prize was first created? Show the trend in awards visually.
5 year rolling average of the number of prizes.
Looking at the chart, did the first and second world wars have an impact on the number of prizes being given out?
prize_per_year = df_data.groupby(['year'] ,
as_index=False).agg({'prize':pd.Series.count})
prize_per_year
| year | prize | |
|---|---|---|
| 0 | 1901 | 6 |
| 1 | 1902 | 7 |
| 2 | 1903 | 7 |
| 3 | 1904 | 6 |
| 4 | 1905 | 5 |
| ... | ... | ... |
| 112 | 2016 | 11 |
| 113 | 2017 | 12 |
| 114 | 2018 | 13 |
| 115 | 2019 | 14 |
| 116 | 2020 | 12 |
117 rows × 2 columns
fig = px.scatter(prize_per_year,y = 'prize' ,
x = 'year')
fig.update_layout(xaxis_title = 'year',
yaxis_title = 'number of prizes',
)
fig.show()
prize_per_year = df_data.groupby(by='year').count().prize
moving_average = prize_per_year.rolling(window=5).mean()
np.arange(1900, 2021, step=5)
plt.figure(figsize=(16,8), dpi=200)
plt.title('Number of Nobel Prizes Awarded per Year', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(ticks=np.arange(1900, 2021, step=5),
fontsize=14,
rotation=45)
ax = plt.gca() # get current axis
ax.set_xlim(1900, 2020)
plt.scatter(x=prize_per_year.index,
y=prize_per_year.values,
c='dodgerblue',
alpha=0.7,
s=100,)
plt.plot(prize_per_year.index,
moving_average.values,
c='crimson',
linewidth=3,)
plt.show()
Investigating if more prizes are shared than before.
yearly_avg_share = df_data.groupby(by='year').agg({'share_pct': pd.Series.mean})
share_moving_average = yearly_avg_share.rolling(window=5).mean()
plt.figure(figsize=(16,8), dpi=200)
plt.title('Number of Nobel Prizes Awarded per Year', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(ticks=np.arange(1900, 2021, step=5),
fontsize=14,
rotation=45)
ax1 = plt.gca()
ax2 = ax1.twinx() # create second y-axis
ax1.set_xlim(1900, 2020)
ax1.scatter(x=prize_per_year.index,
y=prize_per_year.values,
c='dodgerblue',
alpha=0.7,
s=100,)
ax1.plot(prize_per_year.index,
moving_average.values,
c='crimson',
linewidth=3,)
# Adding prize share plot on second axis
ax2.plot(prize_per_year.index,
share_moving_average.values,
c='grey',
linewidth=3,)
plt.show()
plt.figure(figsize=(16,8), dpi=200)
plt.title('Number of Nobel Prizes Awarded per Year', fontsize=18)
plt.yticks(fontsize=14)
plt.xticks(ticks=np.arange(1900, 2021, step=5),
fontsize=14,
rotation=45)
ax1 = plt.gca()
ax2 = ax1.twinx()
ax1.set_xlim(1900, 2020)
# Can invert axis
ax2.invert_yaxis()
ax1.scatter(x=prize_per_year.index,
y=prize_per_year.values,
c='dodgerblue',
alpha=0.7,
s=100,)
ax1.plot(prize_per_year.index,
moving_average.values,
c='crimson',
linewidth=3,)
ax2.plot(prize_per_year.index,
share_moving_average.values,
c='grey',
linewidth=3,)
plt.show()
DataFrame called top20_countries that has the two columns. The prize column contain the total number of prizes won.
What is the ranking for the top 20 countries in terms of the number of prizes?
top20_countries = df_data.groupby(by = 'birth_country_current',
as_index=False).agg({'prize':pd.Series.count})
# top20_countries
top20_countries = top20_countries.sort_values('prize')[-20:]#to plot correctly
plt.figure(figsize=(16,8), dpi=200)
fig = px.bar(top20_countries,y = 'birth_country_current' ,
x = 'prize',orientation='h',
color ='prize',
color_continuous_scale='Aggrnyl',
title='top 20 countries by number of prizes')
fig.update_layout(xaxis_title = 'number of prizes',
yaxis_title = 'country',
coloraxis_showscale = False )
fig.show()
<Figure size 3200x1600 with 0 Axes>
df_countries = df_data.groupby(['birth_country_current', 'ISO'],
as_index=False).agg({'prize': pd.Series.count})
df_countries.sort_values('prize', ascending=False)
| birth_country_current | ISO | prize | |
|---|---|---|---|
| 74 | United States of America | USA | 281 |
| 73 | United Kingdom | GBR | 105 |
| 26 | Germany | DEU | 84 |
| 25 | France | FRA | 57 |
| 67 | Sweden | SWE | 29 |
| ... | ... | ... | ... |
| 32 | Iceland | ISL | 1 |
| 47 | Madagascar | MDG | 1 |
| 34 | Indonesia | IDN | 1 |
| 36 | Iraq | IRQ | 1 |
| 78 | Zimbabwe | ZWE | 1 |
79 rows × 3 columns
fig = px.choropleth(df_countries, locations='ISO', color='prize',
color_continuous_scale=px.colors.sequential.matter,
range_color=(0, 250),
hover_name='birth_country_current',
)
# fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
dividing up the plotly bar chart you created above to show the which categories made up the total number of prizes. Here's what we're aiming for:
df_categories = df_data.groupby(['birth_country_current', 'category'],
as_index=False).agg({'prize': pd.Series.count })
df_categories.sort_values('prize', ascending=False)
| birth_country_current | category | prize | |
|---|---|---|---|
| 204 | United States of America | Medicine | 78 |
| 206 | United States of America | Physics | 70 |
| 201 | United States of America | Chemistry | 55 |
| 202 | United States of America | Economics | 49 |
| 198 | United Kingdom | Medicine | 28 |
| ... | ... | ... | ... |
| 97 | Iraq | Peace | 1 |
| 99 | Ireland | Medicine | 1 |
| 100 | Ireland | Physics | 1 |
| 102 | Israel | Economics | 1 |
| 210 | Zimbabwe | Peace | 1 |
211 rows × 3 columns
merged_df = pd.merge(df_categories , top20_countries , on='birth_country_current')
merged_df.columns = ['birth_country_current' , 'category','cat_prize','total_prize']
merged_df.sort_values(by='total_prize' , inplace=True)
cat_cntry_bar = px.bar(x=merged_df.cat_prize,
y=merged_df.birth_country_current,
color=merged_df.category,
orientation='h',
title='Top 20 Countries by Number of Prizes and Category')
cat_cntry_bar.update_layout(xaxis_title='Number of Prizes',
yaxis_title='Country')
cat_cntry_bar.show()
prize_by_year = df_data.groupby(by=['birth_country_current', 'year'], as_index=False).count()
prize_by_year = prize_by_year.sort_values('year')[['year', 'birth_country_current', 'prize']]
cumulative_prizes = prize_by_year.groupby(by=['birth_country_current',
'year']).sum().groupby(level=[0]).cumsum()
cumulative_prizes.reset_index(inplace=True)
cumulative_prizes
| birth_country_current | year | prize | |
|---|---|---|---|
| 0 | Algeria | 1957 | 1 |
| 1 | Algeria | 1997 | 2 |
| 2 | Argentina | 1936 | 1 |
| 3 | Argentina | 1947 | 2 |
| 4 | Argentina | 1980 | 3 |
| ... | ... | ... | ... |
| 622 | United States of America | 2020 | 281 |
| 623 | Venezuela | 1980 | 1 |
| 624 | Vietnam | 1973 | 1 |
| 625 | Yemen | 2011 | 1 |
| 626 | Zimbabwe | 1960 | 1 |
627 rows × 3 columns
l_chart = px.line(cumulative_prizes,
x='year',
y='prize',
color='birth_country_current',
hover_name='birth_country_current')
l_chart.update_layout(xaxis_title='Year',
yaxis_title='Number of Prizes')
l_chart.show()
Creating a bar chart showing the organisations affiliated with the Nobel laureates. It should looks something like this:
top20_org = df_data.groupby(by = 'organization_name',
as_index=False).agg({'prize':pd.Series.count})
# # top20_countries
top20_org = top20_org.sort_values('prize')[-20:]#to plot correctly
plt.figure(figsize=(16,8), dpi=200)
fig = px.bar(top20_org,y = 'organization_name' ,
x = 'prize',orientation='h',
color ='prize',
color_continuous_scale='Aggrnyl',
title='top 20 orgs by number of prizes')
fig.update_layout(xaxis_title = 'number of prizes',
yaxis_title = 'Org',
coloraxis_showscale = False )
fig.show()
<Figure size 3200x1600 with 0 Axes>
Where do major discoveries take place?
top20_org_city = df_data.groupby(by = 'organization_city',
as_index=False).agg({'prize':pd.Series.count})
# # top20_countries
top20_org_city = top20_org_city.sort_values('prize')[-20:]#to plot correctly
plt.figure(figsize=(16,8), dpi=200)
fig = px.bar(top20_org_city,y = 'organization_city' ,
x = 'prize',orientation='h',
color ='prize',
color_continuous_scale='Aggrnyl',
title='top 20 org_city by number of prizes')
fig.update_layout(xaxis_title = 'number of prizes',
yaxis_title = 'organization_city',
coloraxis_showscale = False )
fig.show()
<Figure size 3200x1600 with 0 Axes>
df_data.head()
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1901 | Chemistry | The Nobel Prize in Chemistry 1901 | "in recognition of the extraordinary services ... | 1/1 | Individual | Jacobus Henricus van 't Hoff | 1852-08-30 | Rotterdam | Netherlands | Netherlands | Male | Berlin University | Berlin | Germany | NLD | 1.00 |
| 1 | 1901 | Literature | The Nobel Prize in Literature 1901 | "in special recognition of his poetic composit... | 1/1 | Individual | Sully Prudhomme | 1839-03-16 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 1.00 |
| 2 | 1901 | Medicine | The Nobel Prize in Physiology or Medicine 1901 | "for his work on serum therapy, especially its... | 1/1 | Individual | Emil Adolf von Behring | 1854-03-15 | Hansdorf (Lawice) | Prussia (Poland) | Poland | Male | Marburg University | Marburg | Germany | POL | 1.00 |
| 3 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Frédéric Passy | 1822-05-20 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 0.50 |
| 4 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Jean Henry Dunant | 1828-05-08 | Geneva | Switzerland | Switzerland | Male | NaN | NaN | NaN | CHE | 0.50 |
top20_birth_city = df_data.groupby(by = 'birth_city',
as_index=False).agg({'prize':pd.Series.count})
# # top20_countries
top20_birth_city = top20_birth_city.sort_values('prize')[-20:]#to plot correctly
plt.figure(figsize=(16,8), dpi=200)
fig = px.bar(top20_birth_city,y = 'birth_city' ,
x = 'prize',orientation='h',
color ='prize',
color_continuous_scale='Plasma',
title='top 20 birth_city by number of prizes')
fig.update_layout(xaxis_title = 'number of prizes',
yaxis_title = 'birth_city',
coloraxis_showscale = False )
fig.show()
<Figure size 3200x1600 with 0 Axes>
prizes_Per_org_cntry = df_data.groupby(by=['organization_country',
'organization_city',
'organization_name'],
as_index=False).agg({'prize':pd.Series.count})
prizes_Per_org_cntry = prizes_Per_org_cntry.sort_values('prize',ascending=False)#to plot correctly
prizes_Per_org_cntry
| organization_country | organization_city | organization_name | prize | |
|---|---|---|---|---|
| 205 | United States of America | Cambridge, MA | Harvard University | 29 |
| 280 | United States of America | Stanford, CA | Stanford University | 23 |
| 206 | United States of America | Cambridge, MA | Massachusetts Institute of Technology (MIT) | 21 |
| 209 | United States of America | Chicago, IL | University of Chicago | 20 |
| 195 | United States of America | Berkeley, CA | University of California | 19 |
| ... | ... | ... | ... | ... |
| 110 | Japan | Sapporo | Hokkaido University | 1 |
| 111 | Japan | Tokyo | Asahi Kasei Corporation | 1 |
| 112 | Japan | Tokyo | Kitasato University | 1 |
| 113 | Japan | Tokyo | Tokyo Institute of Technology | 1 |
| 290 | United States of America | Yorktown Heights, NY | IBM Thomas J. Watson Research Center | 1 |
291 rows × 4 columns
fig = px.sunburst(prizes_Per_org_cntry, path=['organization_country',
'organization_city',
'organization_name'],
values='prize',
color='prize', hover_data=['prize'],
color_continuous_scale='RdBu',
color_continuous_midpoint=np.average(prizes_Per_org_cntry['prize'],
weights=prizes_Per_org_cntry['prize']))
fig.show()
How Old Are the Laureates When the Win the Prize?
df_data.head()
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1901 | Chemistry | The Nobel Prize in Chemistry 1901 | "in recognition of the extraordinary services ... | 1/1 | Individual | Jacobus Henricus van 't Hoff | 1852-08-30 | Rotterdam | Netherlands | Netherlands | Male | Berlin University | Berlin | Germany | NLD | 1.00 |
| 1 | 1901 | Literature | The Nobel Prize in Literature 1901 | "in special recognition of his poetic composit... | 1/1 | Individual | Sully Prudhomme | 1839-03-16 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 1.00 |
| 2 | 1901 | Medicine | The Nobel Prize in Physiology or Medicine 1901 | "for his work on serum therapy, especially its... | 1/1 | Individual | Emil Adolf von Behring | 1854-03-15 | Hansdorf (Lawice) | Prussia (Poland) | Poland | Male | Marburg University | Marburg | Germany | POL | 1.00 |
| 3 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Frédéric Passy | 1822-05-20 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 0.50 |
| 4 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Jean Henry Dunant | 1828-05-08 | Geneva | Switzerland | Switzerland | Male | NaN | NaN | NaN | CHE | 0.50 |
birth_age = pd.to_numeric(df_data.birth_date.dt.strftime('%Y'))
df_data['winning_age'] = df_data.year - birth_age
df_data
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | winning_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1901 | Chemistry | The Nobel Prize in Chemistry 1901 | "in recognition of the extraordinary services ... | 1/1 | Individual | Jacobus Henricus van 't Hoff | 1852-08-30 | Rotterdam | Netherlands | Netherlands | Male | Berlin University | Berlin | Germany | NLD | 1.00 | 49.00 |
| 1 | 1901 | Literature | The Nobel Prize in Literature 1901 | "in special recognition of his poetic composit... | 1/1 | Individual | Sully Prudhomme | 1839-03-16 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 1.00 | 62.00 |
| 2 | 1901 | Medicine | The Nobel Prize in Physiology or Medicine 1901 | "for his work on serum therapy, especially its... | 1/1 | Individual | Emil Adolf von Behring | 1854-03-15 | Hansdorf (Lawice) | Prussia (Poland) | Poland | Male | Marburg University | Marburg | Germany | POL | 1.00 | 47.00 |
| 3 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Frédéric Passy | 1822-05-20 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 0.50 | 79.00 |
| 4 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Jean Henry Dunant | 1828-05-08 | Geneva | Switzerland | Switzerland | Male | NaN | NaN | NaN | CHE | 0.50 | 73.00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 957 | 2020 | Medicine | The Nobel Prize in Physiology or Medicine 2020 | “for the discovery of Hepatitis C virus” | 1/3 | Individual | Michael Houghton | 1949-07-02 | NaN | United Kingdom | United Kingdom | Male | University of Alberta | Edmonton | Canada | GBR | 0.33 | 71.00 |
| 958 | 2020 | Peace | The Nobel Peace Prize 2020 | “for its efforts to combat hunger, for its con... | 1/1 | Organization | World Food Programme (WFP) | NaT | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1.00 | NaN |
| 959 | 2020 | Physics | The Nobel Prize in Physics 2020 | “for the discovery of a supermassive compact o... | 1/4 | Individual | Andrea Ghez | 1965-06-16 | New York, NY | United States of America | United States of America | Female | University of California | Berkeley, CA | United States of America | USA | 0.25 | 55.00 |
| 960 | 2020 | Physics | The Nobel Prize in Physics 2020 | “for the discovery of a supermassive compact o... | 1/4 | Individual | Reinhard Genzel | 1952-03-24 | Bad Homburg vor der Höhe | Germany | Germany | Male | University of California | Los Angeles, CA | United States of America | DEU | 0.25 | 68.00 |
| 961 | 2020 | Physics | The Nobel Prize in Physics 2020 | “for the discovery that black hole formation i... | 1/2 | Individual | Roger Penrose | 1931-08-08 | Colchester | United Kingdom | United Kingdom | Male | University of Oxford | Oxford | United Kingdom | GBR | 0.50 | 89.00 |
962 rows × 18 columns
df_data.describe()
| year | share_pct | winning_age | |
|---|---|---|---|
| count | 962.00 | 962.00 | 934.00 |
| mean | 1,971.82 | 0.63 | 59.95 |
| std | 33.81 | 0.29 | 12.62 |
| min | 1,901.00 | 0.25 | 17.00 |
| 25% | 1,948.00 | 0.33 | 51.00 |
| 50% | 1,977.00 | 0.50 | 60.00 |
| 75% | 2,001.00 | 1.00 | 69.00 |
| max | 2,020.00 | 1.00 | 97.00 |
print('oldest winner')
display(df_data.nlargest(n=1, columns='winning_age'))
print('youngest winner')
display(df_data.nsmallest(n=1, columns='winning_age'))
oldest winner
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | winning_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 937 | 2019 | Chemistry | The Nobel Prize in Chemistry 2019 | “for the development of lithium-ion batteries” | 1/3 | Individual | John Goodenough | 1922-07-25 | Jena | Germany | Germany | Male | University of Texas | Austin TX | United States of America | DEU | 0.33 | 97.00 |
youngest winner
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | winning_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 885 | 2014 | Peace | The Nobel Peace Prize 2014 | "for their struggle against the suppression of... | 1/2 | Individual | Malala Yousafzai | 1997-07-12 | Mingora | Pakistan | Pakistan | Female | NaN | NaN | NaN | PAK | 0.50 | 17.00 |
df_data.describe()
| year | share_pct | winning_age | |
|---|---|---|---|
| count | 962.00 | 962.00 | 934.00 |
| mean | 1,971.82 | 0.63 | 59.95 |
| std | 33.81 | 0.29 | 12.62 |
| min | 1,901.00 | 0.25 | 17.00 |
| 25% | 1,948.00 | 0.33 | 51.00 |
| 50% | 1,977.00 | 0.50 | 60.00 |
| 75% | 2,001.00 | 1.00 | 69.00 |
| max | 2,020.00 | 1.00 | 97.00 |
plt.figure(figsize=(8, 4), dpi=200)
sns.histplot(data=df_data,
x=df_data.winning_age,
bins=30)
plt.xlabel('Age')
plt.title('Distribution of Age on Receipt of Prize')
plt.show()
Are Nobel laureates being nominated later in life than before? Have the ages of laureates at the time of the award increased or decreased over time?
The histogram above shows us the distribution across the entire dataset, over the entire time period. But perhaps the age has changed over time.
plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style("whitegrid"):
sns.regplot(data=df_data,
x='year',
y='winning_age',
lowess=True,
scatter_kws = {'alpha': 0.4},
line_kws={'color': 'black'})
plt.show()
How does the age of laureates vary by category?
df_data.head()
| year | category | prize | motivation | prize_share | laureate_type | full_name | birth_date | birth_city | birth_country | birth_country_current | sex | organization_name | organization_city | organization_country | ISO | share_pct | winning_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1901 | Chemistry | The Nobel Prize in Chemistry 1901 | "in recognition of the extraordinary services ... | 1/1 | Individual | Jacobus Henricus van 't Hoff | 1852-08-30 | Rotterdam | Netherlands | Netherlands | Male | Berlin University | Berlin | Germany | NLD | 1.00 | 49.00 |
| 1 | 1901 | Literature | The Nobel Prize in Literature 1901 | "in special recognition of his poetic composit... | 1/1 | Individual | Sully Prudhomme | 1839-03-16 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 1.00 | 62.00 |
| 2 | 1901 | Medicine | The Nobel Prize in Physiology or Medicine 1901 | "for his work on serum therapy, especially its... | 1/1 | Individual | Emil Adolf von Behring | 1854-03-15 | Hansdorf (Lawice) | Prussia (Poland) | Poland | Male | Marburg University | Marburg | Germany | POL | 1.00 | 47.00 |
| 3 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Frédéric Passy | 1822-05-20 | Paris | France | France | Male | NaN | NaN | NaN | FRA | 0.50 | 79.00 |
| 4 | 1901 | Peace | The Nobel Peace Prize 1901 | NaN | 1/2 | Individual | Jean Henry Dunant | 1828-05-08 | Geneva | Switzerland | Switzerland | Male | NaN | NaN | NaN | CHE | 0.50 | 73.00 |
plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style("whitegrid"):
sns.boxplot(data=df_data,
x='category',
y='winning_age')
plt.show()
.lmplot() telling a different story from the .boxplot()?.lmplot() to put all 6 categories on the same chart using the hue parameter. with sns.axes_style('whitegrid'):
sns.lmplot(data=df_data,
x='year',
y='winning_age',
row = 'category',
lowess=True,
aspect=2,
scatter_kws = {'alpha': 0.6},
line_kws = {'color': 'black'},)
plt.show()
combining all these charts into the same chart
with sns.axes_style("whitegrid"):
sns.lmplot(data=df_data,
x='year',
y='winning_age',
hue='category',
lowess=True,
aspect=2,
scatter_kws={'alpha': 0.5},
line_kws={'linewidth': 5})
plt.show()